智能论文笔记

HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response

Selim Fekih , Nicolò Tamagnone , Benjamin Minixhofer , Ranjan Shrestha , Ximena Contla , Ewan Oglethorpe , Navid Rekabsaz

分类：自然语言处理 | 机器学习

2022-10-10

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data - a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HUMSET provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HUMSET also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of experiments on Pre-trained Language Models (PLM) to establish strong baselines for future research in this domain. The dataset is available at https://blog.thedeep.io/humset/.

translated by 谷歌翻译

在本文中，我们推出了一种新的通用依赖树木库，用于亚马逊尼亚的一种濒危语言：秘鲁在秘鲁说的Panoan语言Kakataibo。我们首先讨论实施的协作方法，事实证明，在本科生的计算语言课程的背景下创建树库有效。然后，我们描述了树库的一般细节以及针对拟议的注释实施的特定于语言的注意事项。我们最终对词性标记和句法依赖性解析进行了一些实验。我们专注于单语和转移学习设置，在这里我们研究了另一种Panoan语言资源的Shipibo-Konibo Treebos的影响。

translated by 谷歌翻译